Introduction

This graphical summary has the following sections

  • Preprocessing
  • Topic modeling
  • Document clustering accross topic hierarchy
  • Topic hierarchy visualization

Preprocessing

Steps

  • Full texts were collected to local Zotero library using Zotero connector
  • Text was extracted using PyPDF2
  • Punctuation was removed using nltk and regular expressions
  • Texts were tokenized using nltk
  • Tokens that occurred in a document less than 3 or more than 950 times were removed as suggested in Khodorchenko, M. et al. (2020).
  • Additionally tokens that consisted of the same two character combination by more than 50% of length were removed.
    • It was discovered via manual inspection of pre-processed dataset that this step helps to reduce number of uninformative tokens
      (also see Figure 1 and 2).
  • Stop word removal was conducted using stop word list from nltk that was extended to reduce the number of uninformative terms.
  • Coocurrence counts was calculated for both datasets using custom python script with window size of 10 tokens.
  • The corpus and coocurrence counts were saved in a format accepted by the BigARTM library and used to construct hierarchical topic models.

Comments on grid plot structure

  • The grid plots shows number of tokens by the maximum observed fraction
    of token length that consisted of the same two character combination.
  • Each row shows result of a dataset split, given fraction of length threshold.
  • The first column shows the counts for tokens above threshold (termed Noise here).
  • The second column shows the counts for tokens below threshold (termed Clean here).
  • The third column shows the wordcloud plot for the "Noise" tokens.
  • The fourth column shows the wordcloud plot for the "Clean" tokens.

Figure 1: Repeated dimer filtering - natural language processing dataset

Figure 2: Repeated dimer filtering - bioinformatics dataset

Topic modeling

Document clustering accross topic hierarchy

  • Hierarchical topic model based on Chirkova, N.A., 2016 allows to calculate a topic distribution for each document in the corpus.
  • Such vectors corresponds to a discrete probability distribution and can be used to compare the documents at a given level of topic hierarchy, similarly to neural network embeddings.
  • Additionally for each level of topic hierarchy (except the first one) it is possible to get vectors representing super-topics in terms of sub-topics. This allows to treat super-topics (at the higher level of hierarchy) as pseudo-documents and include them into document matrix. This is the basis of the approach for calculating topic hierarchy described by Chirkova, N.A., 2016.
  • Calculating Hellinger distance between such vector (documents and pseudo-documents) was suggested as one of the quality metrics for the model in the original publication.
  • Here it was attempted to generalize this approach by implementing document similarity calculation using three additional steps:
    • First, given a matrix of topic-based document distributions (termed Phi), and a matrix of pseudo-document distributions (termed Psi) a combined matrix is generated.
    • Next from such combined matrix the square pairwise distance matrix is calculated using Hellinger distance formula.
    • Finally the distance matrix is converted to similarity matrix using Bhattacharyya coefficient as discussed in Kitsos, C.P. and Nisiotis, C.-S. (2022).
    • Using the resulting similarity matrix it was possible to perform spectral clustering of documents to assign groups of topic-based similarity within each level of topic hierarchy.
    • It was also possible to visually represent documents at each level of hierarchy in two-dimensional space using Multidimensional scaling on the original distance matrix to generate a scatter plot.
    • This plot was additionally annotated with document labels and connections between documents where pairwise Hellinger distance values were below the specified threshold. The resulting plots are shown in figures 4 and 7.
    • Tables illustrate correspondance between maximum probability topic index and cluster.
    • Sankey plots allow to see the discrepancies between spectral clustering results and topic labels with highest probabilities at a given layer
    • The human-in-the-loop evaluation of actual document similarity within discovered groups is to be conducted.

Topic hierarchy visualization

  • In order to show the resulting hierarchy and the connections discovered by the model an additional function was developed to represent all levels of topic hierarchy and include any connections between the layers with model-assigned probability value below a specified threshold.
  • The resulting plots are shown in figures 5 and 8.

Results for BIOIT set

CPU times: user 18.3 s, sys: 2.8 s, total: 21.1 s
Wall time: 13.8 s
level0

topic_0:  ['data', 'sequencing', 'analysis', 'cells', 'cell', 'dna', 'cancer', 'methods', 'used', 'gene']
topic_1:  ['reads', 'genome', 'read', 'data', 'alignment', 'reference', 'variant', 'sequencing', 'genomes', 'coverage']

level1

topic_0:  ['sequencing', 'dna', 'cancer', 'resistance', 'gene', 'ngs', 'genes', 'using', 'detection', 'protein']
topic_1:  ['variant', 'kraken', 'variants', 'regions', 'normalization', 'benchmark', 'species', 'scone', 'snps', 'wgs']
topic_2:  ['reads', 'genome', 'read', 'alignment', 'assembly', 'coverage', 'genomes', 'reference', 'contigs', 'graph']
topic_3:  ['data', 'analysis', 'cell', 'cells', 'methods', 'metagenomic', 'used', 'expression', 'nat', 'metagenomics']

level2

topic_0:  ['alignment', 'bioinformatics', 'tools', 'algorithms', 'umap', 'mapping', 'short', 'length', 'algorithm', 'fuzzy']
topic_1:  ['genomes', 'contigs', 'lineage', 'contig', 'tree', 'assemblies', 'supplementary', 'assigned', 'samples', 'grapetree']
topic_2:  ['cell', 'cells', 'methods', 'expression', 'clustering', 'number', 'dataset', 'model', 'clusters', 'scvis']
topic_3:  ['variant', 'variants', 'ngs', 'wgs', 'normalization', 'scone', 'calling', 'depth', 'performance', 'lrs']
topic_4:  ['mash', 'sketch', 'aligner', 'fastp', 'hash', 'mappings', 'quality', 'size', 'mapq', 'file']
topic_5:  ['species', 'learning', 'tumor', 'detection', 'deep', 'genomics', 'liquid', 'circulating', 'pubmed', 'patients']
topic_6:  ['metagenomic', 'metagenomics', 'microbiome', 'args', 'benchmark', 'regions', 'resistance', 'microbial', 'usa', 'pubmed']
topic_7:  ['assembly', 'coverage', 'graph', 'set', 'illumina', 'distance', 'bias', 'forensic', 'ajb', 'ion']

level3

topic_0:  ['variant', 'read', 'genome', 'aligner', 'resfinder', 'mappings', 'resistance', 'mapq', 'reference', 'reads']
topic_1:  ['genomes', 'assembly', 'reads', 'using', 'genome', 'mash', 'benchmark', 'regions', 'variants', 'coverage']
topic_2:  ['data', 'clustering', 'cell', 'cells', 'normalization', 'genes', 'methods', 'expression', 'number', 'used']
topic_3:  ['data', 'coverage', 'learning', 'genome', 'deep', 'bias', 'sequencing', 'genomics', 'illumina', 'human']
topic_4:  ['read', 'alignment', 'reads', 'genome', 'reference', 'sequencing', 'algorithms', 'bioinformatics', 'dna', 'scone']
topic_5:  ['sequencing', 'species', 'data', 'wgs', 'reads', 'lrs', 'variant', 'depth', 'using', 'kraken']
topic_6:  ['umap', 'data', 'usa', 'university', 'fuzzy', 'author', 'set', 'qiime', 'manuscript', 'manifold']
topic_7:  ['analysis', 'data', 'cell', 'cells', 'methods', 'sequencing', 'gatk', 'expression', 'gene', 'genome']
topic_8:  ['snps', 'drosophila', 'variant', 'amino', 'megares', 'gene', 'snpeff', 'variants', 'args', 'protein']
topic_9:  ['cells', 'data', 'args', 'resistance', 'scvis', 'resistome', 'cell', 'dataset', 'bipolar', 'clusters']
topic_10:  ['cancer', 'dna', 'data', 'sequencing', 'analysis', 'tumor', 'detection', 'pubmed', 'liquid', 'circulating']
topic_11:  ['reads', 'graph', 'assembly', 'ajb', 'contigs', 'bruijn', 'distance', 'spades', 'genome', 'edge']
topic_12:  ['metagenomic', 'analysis', 'data', 'sequencing', 'metagenomics', 'microbiome', 'used', 'microbial', 'dna', 'pubmed']
topic_13:  ['data', 'nat', 'methods', 'integration', 'analysis', 'cell', 'reduction', 'sequencing', 'cells', 'joint']
topic_14:  ['sequencing', 'dna', 'ngs', 'cancer', 'analysis', 'data', 'genome', 'forensic', 'technology', 'variant']
topic_15:  ['kraken', 'reference', 'data', 'sequences', 'genes', 'used', 'lineage', 'sequence', 'genomes', 'genome']

Figure 3. Quality metrics accross training iterations for hierarchical model - bioinformatics dataset

Results for level 0

Sparsity Phi: 0.381 
Sparsity Theta: 0.000
Kernel contrast: 0.891
Kernel purity: 0.938
Results for level 1

Sparsity Phi: 0.567 
Sparsity Theta: 0.000
Kernel contrast: 0.846
Kernel purity: 0.890
Results for level 2

Sparsity Phi: 0.735 
Sparsity Theta: 0.009
Kernel contrast: 0.839
Kernel purity: 0.879
Results for level 3

Sparsity Phi: 0.000 
Sparsity Theta: 0.000
Kernel contrast: 0.461
Kernel purity: 0.388

Figure 4. Spectral clustering results - bioinformatics dataset

doc_names cluster_id max_p_topic_id
2 Amarasinghe_et_al._-_2020_-_Opportunities_and_challenges_in_long-read_sequenci 0 0
3 Anders_and_Huber_-_2010_-_Differential_expression_analysis_for_sequence_coun 0 0
4 Bankevich_et_al._-_2012_-_SPAdes_A_New_Genome_Assembly_Algorithm_and_Its_Ap 0 0
6 Bolyen_et_al._-_2019_-_Reproducible,_interactive,_scalable_and_extensible 0 0
8 Brockley_et_al._-_2023_-_Sequence-Based_Platforms_for_Discovering_Biomarker 0 0
9 Caudai_et_al._-_2021_-_AI_applications_in_functional_genomics 0 0
10 Chen_et_al._-_2018_-_fastp_an_ultra-fast_all-in-one_FASTQ_preprocessor 0 0
14 Ding_et_al._-_2018_-_Interpretable_dimensionality_reduction_of_single_c 0 0
15 Ewels_et_al._-_2016_-_MultiQC_summarize_analysis_results_for_multiple_t 0 0
16 Feldgarden_et_al._-_2021_-_AMRFinderPlus_and_the_Reference_Gene_Catalog_facil 0 0
17 Florensa_et_al._-_2022_-_ResFinder_–_an_open_online_resource_for_identifica 0 0
18 Flynn_et_al._-_2023_-_Single-Cell_Multiomics 0 0
19 Hadfield_et_al._-_2018_-_Nextstrain_real-time_tracking_of_pathogen_evoluti 0 0
20 Ibañez-Lligoña_et_al._-_2023_-_Bioinformatic_Tools_for_NGS-Based_Metagenomics_to_ 0 0
24 Lin_et_al._-_2017_-_Using_neural_networks_for_reducing_the_dimensions_ 0 0
25 Liu_et_al._-_2021_-_A_practical_guide_to_amplicon_and_metagenomic_anal 0 0
27 McInnes_et_al._-_2020_-_UMAP_Uniform_Manifold_Approximation_and_Projectio 0 0
30 Nam_et_al._-_2023_-_Metagenomics_An_Effective_Approach_for_Exploring_ 0 0
31 Navgire_et_al._-_2022_-_Analysis_and_Interpretation_of_metagenomics_data_ 0 0
32 Nayak_and_Hasija_-_2021_-_A_hitchhiker's_guide_to_single-cell_transcriptomic 0 0
33 Oehler_et_al._-_2023_-_The_application_of_long-read_sequencing_in_clinica 0 0
36 O’Toole_et_al._-_2021_-_Assignment_of_epidemiological_lineages_in_an_emerg 0 0
38 Roberto_et_al._-_2023_-_Strategies_for_improving_detection_of_circulating_ 0 0
41 Roumpeka_et_al._-_2017_-_A_Review_of_Bioinformatics_Tools_for_Bio-Prospecti 0 0
42 Satam_et_al._-_2023_-_Next-Generation_Sequencing_Technology_Current_Tre 0 0
47 Zhang_et_al._-_2023_-_Review_of_single-cell_RNA-seq_data_clustering_for_ 0 0
48 Zhou_et_al._-_2018_-_GrapeTree_visualization_of_core_genomic_relations 0 0
0 Alneberg_et_al._-_2014_-_Binning_metagenomic_contigs_by_coverage_and_compos 1 1
1 Alser_et_al._-_2021_-_Technology_dictates_algorithms_recent_development 1 1
5 Bertrand_et_al._-_2019_-_Hybrid_metagenomic_assembly_enables_high-resolutio 1 1
7 Bonin_et_al._-_2023_-_MEGARes_and_AMR++,_v3.0_an_updated_comprehensive_ 1 1
11 Cingolani_et_al._-_2012_-_A_program_for_annotating_and_predicting_the_effect 1 1
12 Cole_et_al._-_2019_-_Performance_Assessment_and_Selection_of_Normalizat 1 1
13 Danecek_et_al._-_2021_-_Twelve_years_of_SAMtools_and_BCFtools 1 1
21 Kishikawa_et_al._-_2019_-_Empirical_evaluation_of_variant_calling_accuracy_u 1 1
22 Li_-_2014_-_Toward_better_understanding_of_artifacts_in_varian 1 1
23 Li_and_Durbin_-_2009_-_Fast_and_accurate_short_read_alignment_with_Burrow 1 1
26 Lu_et_al._-_2017_-_Bracken_estimating_species_abundance_in_metagenom 1 1
28 McKenna_et_al._-_2010_-_The_Genome_Analysis_Toolkit_A_MapReduce_framework 1 1
29 Mineeva_et_al._-_2020_-_DeepMAsED_evaluating_the_quality_of_metagenomic_a 1 1
34 Ondov_et_al._-_2016_-_Mash_fast_genome_and_metagenome_distance_estimati 1 1
35 O’Connor_and_Heyderman_-_2023_-_The_challenges_of_defining_the_human_nasopharyngea 1 1
37 Reinert_et_al._-_2015_-_Alignment_of_Next-Generation_Sequencing_Reads 1 1
39 Rodríguez-Brazzarola_et_al._-_2018_-_Analyzing_the_Differences_Between_Reads_and_Contig 1 1
40 Ross_et_al._-_2013_-_Characterizing_and_measuring_bias_in_sequence_data 1 1
43 Wagner_et_al._-_2022_-_Benchmarking_challenging_small_variants_with_linke 1 1
44 Wilton_and_Szalay_-_2023_-_Short-read_aligner_performance_in_germline_variant 1 1
45 Wood_et_al._-_2019_-_Improved_metagenomic_analysis_with_Kraken_2 1 1
46 Yang_et_al._-_2014_-_Application_of_Next-generation_Sequencing_Technolo 1 1
doc_names cluster_id max_p_topic_id
41 Roumpeka_et_al._-_2017_-_A_Review_of_Bioinformatics_Tools_for_Bio-Prospecti 0 0
0 Alneberg_et_al._-_2014_-_Binning_metagenomic_contigs_by_coverage_and_compos 0 2
1 Alser_et_al._-_2021_-_Technology_dictates_algorithms_recent_development 0 2
4 Bankevich_et_al._-_2012_-_SPAdes_A_New_Genome_Assembly_Algorithm_and_Its_Ap 0 2
5 Bertrand_et_al._-_2019_-_Hybrid_metagenomic_assembly_enables_high-resolutio 0 2
10 Chen_et_al._-_2018_-_fastp_an_ultra-fast_all-in-one_FASTQ_preprocessor 0 2
15 Ewels_et_al._-_2016_-_MultiQC_summarize_analysis_results_for_multiple_t 0 2
20 Ibañez-Lligoña_et_al._-_2023_-_Bioinformatic_Tools_for_NGS-Based_Metagenomics_to_ 0 2
23 Li_and_Durbin_-_2009_-_Fast_and_accurate_short_read_alignment_with_Burrow 0 2
28 McKenna_et_al._-_2010_-_The_Genome_Analysis_Toolkit_A_MapReduce_framework 0 2
34 Ondov_et_al._-_2016_-_Mash_fast_genome_and_metagenome_distance_estimati 0 2
37 Reinert_et_al._-_2015_-_Alignment_of_Next-Generation_Sequencing_Reads 0 2
39 Rodríguez-Brazzarola_et_al._-_2018_-_Analyzing_the_Differences_Between_Reads_and_Contig 0 2
40 Ross_et_al._-_2013_-_Characterizing_and_measuring_bias_in_sequence_data 0 2
2 Amarasinghe_et_al._-_2020_-_Opportunities_and_challenges_in_long-read_sequenci 0 3
29 Mineeva_et_al._-_2020_-_DeepMAsED_evaluating_the_quality_of_metagenomic_a 0 3
8 Brockley_et_al._-_2023_-_Sequence-Based_Platforms_for_Discovering_Biomarker 1 0
9 Caudai_et_al._-_2021_-_AI_applications_in_functional_genomics 1 0
13 Danecek_et_al._-_2021_-_Twelve_years_of_SAMtools_and_BCFtools 1 0
16 Feldgarden_et_al._-_2021_-_AMRFinderPlus_and_the_Reference_Gene_Catalog_facil 1 0
17 Florensa_et_al._-_2022_-_ResFinder_–_an_open_online_resource_for_identifica 1 0
33 Oehler_et_al._-_2023_-_The_application_of_long-read_sequencing_in_clinica 1 0
35 O’Connor_and_Heyderman_-_2023_-_The_challenges_of_defining_the_human_nasopharyngea 1 0
38 Roberto_et_al._-_2023_-_Strategies_for_improving_detection_of_circulating_ 1 0
42 Satam_et_al._-_2023_-_Next-Generation_Sequencing_Technology_Current_Tre 1 0
46 Yang_et_al._-_2014_-_Application_of_Next-generation_Sequencing_Technolo 1 0
7 Bonin_et_al._-_2023_-_MEGARes_and_AMR++,_v3.0_an_updated_comprehensive_ 2 1
11 Cingolani_et_al._-_2012_-_A_program_for_annotating_and_predicting_the_effect 2 1
12 Cole_et_al._-_2019_-_Performance_Assessment_and_Selection_of_Normalizat 2 1
21 Kishikawa_et_al._-_2019_-_Empirical_evaluation_of_variant_calling_accuracy_u 2 1
22 Li_-_2014_-_Toward_better_understanding_of_artifacts_in_varian 2 1
26 Lu_et_al._-_2017_-_Bracken_estimating_species_abundance_in_metagenom 2 1
36 O’Toole_et_al._-_2021_-_Assignment_of_epidemiological_lineages_in_an_emerg 2 1
43 Wagner_et_al._-_2022_-_Benchmarking_challenging_small_variants_with_linke 2 1
44 Wilton_and_Szalay_-_2023_-_Short-read_aligner_performance_in_germline_variant 2 1
45 Wood_et_al._-_2019_-_Improved_metagenomic_analysis_with_Kraken_2 2 1
48 Zhou_et_al._-_2018_-_GrapeTree_visualization_of_core_genomic_relations 2 1
3 Anders_and_Huber_-_2010_-_Differential_expression_analysis_for_sequence_coun 3 3
6 Bolyen_et_al._-_2019_-_Reproducible,_interactive,_scalable_and_extensible 3 3
14 Ding_et_al._-_2018_-_Interpretable_dimensionality_reduction_of_single_c 3 3
18 Flynn_et_al._-_2023_-_Single-Cell_Multiomics 3 3
19 Hadfield_et_al._-_2018_-_Nextstrain_real-time_tracking_of_pathogen_evoluti 3 3
24 Lin_et_al._-_2017_-_Using_neural_networks_for_reducing_the_dimensions_ 3 3
25 Liu_et_al._-_2021_-_A_practical_guide_to_amplicon_and_metagenomic_anal 3 3
27 McInnes_et_al._-_2020_-_UMAP_Uniform_Manifold_Approximation_and_Projectio 3 3
30 Nam_et_al._-_2023_-_Metagenomics_An_Effective_Approach_for_Exploring_ 3 3
31 Navgire_et_al._-_2022_-_Analysis_and_Interpretation_of_metagenomics_data_ 3 3
32 Nayak_and_Hasija_-_2021_-_A_hitchhiker's_guide_to_single-cell_transcriptomic 3 3
47 Zhang_et_al._-_2023_-_Review_of_single-cell_RNA-seq_data_clustering_for_ 3 3
doc_names cluster_id max_p_topic_id
10 Chen_et_al._-_2018_-_fastp_an_ultra-fast_all-in-one_FASTQ_preprocessor 0 4
34 Ondov_et_al._-_2016_-_Mash_fast_genome_and_metagenome_distance_estimati 0 4
44 Wilton_and_Szalay_-_2023_-_Short-read_aligner_performance_in_germline_variant 0 4
45 Wood_et_al._-_2019_-_Improved_metagenomic_analysis_with_Kraken_2 0 4
0 Alneberg_et_al._-_2014_-_Binning_metagenomic_contigs_by_coverage_and_compos 1 1
5 Bertrand_et_al._-_2019_-_Hybrid_metagenomic_assembly_enables_high-resolutio 1 1
15 Ewels_et_al._-_2016_-_MultiQC_summarize_analysis_results_for_multiple_t 1 1
16 Feldgarden_et_al._-_2021_-_AMRFinderPlus_and_the_Reference_Gene_Catalog_facil 1 1
29 Mineeva_et_al._-_2020_-_DeepMAsED_evaluating_the_quality_of_metagenomic_a 1 1
36 O’Toole_et_al._-_2021_-_Assignment_of_epidemiological_lineages_in_an_emerg 1 1
39 Rodríguez-Brazzarola_et_al._-_2018_-_Analyzing_the_Differences_Between_Reads_and_Contig 1 1
48 Zhou_et_al._-_2018_-_GrapeTree_visualization_of_core_genomic_relations 1 1
19 Hadfield_et_al._-_2018_-_Nextstrain_real-time_tracking_of_pathogen_evoluti 1 6
1 Alser_et_al._-_2021_-_Technology_dictates_algorithms_recent_development 2 0
13 Danecek_et_al._-_2021_-_Twelve_years_of_SAMtools_and_BCFtools 2 0
23 Li_and_Durbin_-_2009_-_Fast_and_accurate_short_read_alignment_with_Burrow 2 0
27 McInnes_et_al._-_2020_-_UMAP_Uniform_Manifold_Approximation_and_Projectio 2 0
37 Reinert_et_al._-_2015_-_Alignment_of_Next-Generation_Sequencing_Reads 2 0
42 Satam_et_al._-_2023_-_Next-Generation_Sequencing_Technology_Current_Tre 3 3
8 Brockley_et_al._-_2023_-_Sequence-Based_Platforms_for_Discovering_Biomarker 3 5
9 Caudai_et_al._-_2021_-_AI_applications_in_functional_genomics 3 5
26 Lu_et_al._-_2017_-_Bracken_estimating_species_abundance_in_metagenom 3 5
38 Roberto_et_al._-_2023_-_Strategies_for_improving_detection_of_circulating_ 3 5
12 Cole_et_al._-_2019_-_Performance_Assessment_and_Selection_of_Normalizat 4 3
21 Kishikawa_et_al._-_2019_-_Empirical_evaluation_of_variant_calling_accuracy_u 4 3
22 Li_-_2014_-_Toward_better_understanding_of_artifacts_in_varian 4 3
33 Oehler_et_al._-_2023_-_The_application_of_long-read_sequencing_in_clinica 4 3
25 Liu_et_al._-_2021_-_A_practical_guide_to_amplicon_and_metagenomic_anal 5 6
2 Amarasinghe_et_al._-_2020_-_Opportunities_and_challenges_in_long-read_sequenci 5 7
4 Bankevich_et_al._-_2012_-_SPAdes_A_New_Genome_Assembly_Algorithm_and_Its_Ap 5 7
20 Ibañez-Lligoña_et_al._-_2023_-_Bioinformatic_Tools_for_NGS-Based_Metagenomics_to_ 5 7
40 Ross_et_al._-_2013_-_Characterizing_and_measuring_bias_in_sequence_data 5 7
41 Roumpeka_et_al._-_2017_-_A_Review_of_Bioinformatics_Tools_for_Bio-Prospecti 5 7
46 Yang_et_al._-_2014_-_Application_of_Next-generation_Sequencing_Technolo 5 7
3 Anders_and_Huber_-_2010_-_Differential_expression_analysis_for_sequence_coun 6 2
14 Ding_et_al._-_2018_-_Interpretable_dimensionality_reduction_of_single_c 6 2
18 Flynn_et_al._-_2023_-_Single-Cell_Multiomics 6 2
24 Lin_et_al._-_2017_-_Using_neural_networks_for_reducing_the_dimensions_ 6 2
28 McKenna_et_al._-_2010_-_The_Genome_Analysis_Toolkit_A_MapReduce_framework 6 2
32 Nayak_and_Hasija_-_2021_-_A_hitchhiker's_guide_to_single-cell_transcriptomic 6 2
47 Zhang_et_al._-_2023_-_Review_of_single-cell_RNA-seq_data_clustering_for_ 6 2
6 Bolyen_et_al._-_2019_-_Reproducible,_interactive,_scalable_and_extensible 7 6
7 Bonin_et_al._-_2023_-_MEGARes_and_AMR++,_v3.0_an_updated_comprehensive_ 7 6
11 Cingolani_et_al._-_2012_-_A_program_for_annotating_and_predicting_the_effect 7 6
17 Florensa_et_al._-_2022_-_ResFinder_–_an_open_online_resource_for_identifica 7 6
30 Nam_et_al._-_2023_-_Metagenomics_An_Effective_Approach_for_Exploring_ 7 6
31 Navgire_et_al._-_2022_-_Analysis_and_Interpretation_of_metagenomics_data_ 7 6
35 O’Connor_and_Heyderman_-_2023_-_The_challenges_of_defining_the_human_nasopharyngea 7 6
43 Wagner_et_al._-_2022_-_Benchmarking_challenging_small_variants_with_linke 7 6
doc_names cluster_id max_p_topic_id
7 Bonin_et_al._-_2023_-_MEGARes_and_AMR++,_v3.0_an_updated_comprehensive_ 0 8
11 Cingolani_et_al._-_2012_-_A_program_for_annotating_and_predicting_the_effect 0 8
2 Amarasinghe_et_al._-_2020_-_Opportunities_and_challenges_in_long-read_sequenci 1 11
4 Bankevich_et_al._-_2012_-_SPAdes_A_New_Genome_Assembly_Algorithm_and_Its_Ap 1 11
20 Ibañez-Lligoña_et_al._-_2023_-_Bioinformatic_Tools_for_NGS-Based_Metagenomics_to_ 1 11
39 Rodríguez-Brazzarola_et_al._-_2018_-_Analyzing_the_Differences_Between_Reads_and_Contig 1 11
6 Bolyen_et_al._-_2019_-_Reproducible,_interactive,_scalable_and_extensible 2 6
27 McInnes_et_al._-_2020_-_UMAP_Uniform_Manifold_Approximation_and_Projectio 2 6
31 Navgire_et_al._-_2022_-_Analysis_and_Interpretation_of_metagenomics_data_ 3 0
25 Liu_et_al._-_2021_-_A_practical_guide_to_amplicon_and_metagenomic_anal 3 12
30 Nam_et_al._-_2023_-_Metagenomics_An_Effective_Approach_for_Exploring_ 3 12
41 Roumpeka_et_al._-_2017_-_A_Review_of_Bioinformatics_Tools_for_Bio-Prospecti 3 12
22 Li_-_2014_-_Toward_better_understanding_of_artifacts_in_varian 4 14
42 Satam_et_al._-_2023_-_Next-Generation_Sequencing_Technology_Current_Tre 4 14
46 Yang_et_al._-_2014_-_Application_of_Next-generation_Sequencing_Technolo 4 14
16 Feldgarden_et_al._-_2021_-_AMRFinderPlus_and_the_Reference_Gene_Catalog_facil 5 15
29 Mineeva_et_al._-_2020_-_DeepMAsED_evaluating_the_quality_of_metagenomic_a 5 15
36 O’Toole_et_al._-_2021_-_Assignment_of_epidemiological_lineages_in_an_emerg 5 15
45 Wood_et_al._-_2019_-_Improved_metagenomic_analysis_with_Kraken_2 5 15
18 Flynn_et_al._-_2023_-_Single-Cell_Multiomics 6 13
15 Ewels_et_al._-_2016_-_MultiQC_summarize_analysis_results_for_multiple_t 7 7
28 McKenna_et_al._-_2010_-_The_Genome_Analysis_Toolkit_A_MapReduce_framework 7 7
32 Nayak_and_Hasija_-_2021_-_A_hitchhiker's_guide_to_single-cell_transcriptomic 7 7
12 Cole_et_al._-_2019_-_Performance_Assessment_and_Selection_of_Normalizat 8 2
24 Lin_et_al._-_2017_-_Using_neural_networks_for_reducing_the_dimensions_ 8 2
47 Zhang_et_al._-_2023_-_Review_of_single-cell_RNA-seq_data_clustering_for_ 8 2
13 Danecek_et_al._-_2021_-_Twelve_years_of_SAMtools_and_BCFtools 9 5
21 Kishikawa_et_al._-_2019_-_Empirical_evaluation_of_variant_calling_accuracy_u 9 5
26 Lu_et_al._-_2017_-_Bracken_estimating_species_abundance_in_metagenom 9 5
33 Oehler_et_al._-_2023_-_The_application_of_long-read_sequencing_in_clinica 9 5
8 Brockley_et_al._-_2023_-_Sequence-Based_Platforms_for_Discovering_Biomarker 10 10
10 Chen_et_al._-_2018_-_fastp_an_ultra-fast_all-in-one_FASTQ_preprocessor 10 10
19 Hadfield_et_al._-_2018_-_Nextstrain_real-time_tracking_of_pathogen_evoluti 10 10
38 Roberto_et_al._-_2023_-_Strategies_for_improving_detection_of_circulating_ 10 10
48 Zhou_et_al._-_2018_-_GrapeTree_visualization_of_core_genomic_relations 10 10
3 Anders_and_Huber_-_2010_-_Differential_expression_analysis_for_sequence_coun 11 3
9 Caudai_et_al._-_2021_-_AI_applications_in_functional_genomics 11 3
40 Ross_et_al._-_2013_-_Characterizing_and_measuring_bias_in_sequence_data 11 3
14 Ding_et_al._-_2018_-_Interpretable_dimensionality_reduction_of_single_c 12 9
35 O’Connor_and_Heyderman_-_2023_-_The_challenges_of_defining_the_human_nasopharyngea 12 9
17 Florensa_et_al._-_2022_-_ResFinder_–_an_open_online_resource_for_identifica 13 0
44 Wilton_and_Szalay_-_2023_-_Short-read_aligner_performance_in_germline_variant 13 0
0 Alneberg_et_al._-_2014_-_Binning_metagenomic_contigs_by_coverage_and_compos 14 1
5 Bertrand_et_al._-_2019_-_Hybrid_metagenomic_assembly_enables_high-resolutio 14 1
34 Ondov_et_al._-_2016_-_Mash_fast_genome_and_metagenome_distance_estimati 14 1
43 Wagner_et_al._-_2022_-_Benchmarking_challenging_small_variants_with_linke 14 1
1 Alser_et_al._-_2021_-_Technology_dictates_algorithms_recent_development 15 4
23 Li_and_Durbin_-_2009_-_Fast_and_accurate_short_read_alignment_with_Burrow 15 4
37 Reinert_et_al._-_2015_-_Alignment_of_Next-Generation_Sequencing_Reads 15 4

Figure 5. Topic hierarchy structure - bioinformatics dataset

Results for NLP set

CPU times: user 1min 5s, sys: 18.7 s, total: 1min 23s
Wall time: 29.9 s
level0

topic_0:  ['data', 'network', 'social', 'used', 'learning', 'spam', 'information', 'embedding', 'networks', 'research']
topic_1:  ['model', 'proceedings', 'conference', 'information', 'learning', 'knowledge', 'text', 'language', 'data', 'methods']

level1

topic_0:  ['data', 'network', 'embedding', 'information', 'news', 'graph', 'clustering', 'learning', 'models', 'networks']
topic_1:  ['clinical', 'social', 'patent', 'text', 'model', 'classification', 'data', 'spam', 'learning', 'detection']
topic_2:  ['knowledge', 'model', 'articles', 'tax', 'used', 'training', 'disease', 'set', 'research', 'quality']
topic_3:  ['proceedings', 'conference', 'language', 'extraction', 'computational', 'knowledge', 'association', 'linguistics', 'learning', 'methods']

level2

topic_0:  ['clustering', 'areas', 'topic', 'institutions', 'recommendation', 'terms', 'thematic', 'technology', 'performance', 'recommendations']
topic_1:  ['model', 'text', 'information', 'used', 'models', 'using', 'clinical', 'conference', 'classification', 'language']
topic_2:  ['event', 'sket', 'reports', 'pathology', 'engineering', 'events', 'argument', 'cancer', 'topics', 'archetype']
topic_3:  ['articles', 'example', 'dream', 'citation', 'training', 'article', 'prefiltering', 'traf', 'trial', 'sampling']
topic_4:  ['knowledge', 'proceedings', 'extraction', 'conference', 'computational', 'language', 'methods', 'entity', 'relation', 'concept']
topic_5:  ['social', 'spam', 'tax', 'detection', 'features', 'twitter', 'techniques', 'users', 'accounts', 'cases']
topic_6:  ['patent', 'questions', 'question', 'problem', 'patents', 'modeling', 'class', 'study', 'problems', 'classification']
topic_7:  ['network', 'networks', 'graph', 'embedding', 'nodes', 'node', 'disease', 'representation', 'gcn', 'drug']

level3

topic_0:  ['areas', 'institutions', 'recommendation', 'thematic', 'recommendations', 'system', 'set', 'collaboration', 'technology', 'institution']
topic_1:  ['example', 'dream', 'house', 'dreams', 'situation', 'reports', 'flying', 'situations', 'falling', 'groups']
topic_2:  ['political', 'model', 'text', 'classification', 'detection', 'work', 'seed', 'label', 'data', 'policy']
topic_3:  ['construction', 'research', 'data', 'text', 'argument', 'analysis', 'nlp', 'documents', 'media', 'mining']
topic_4:  ['proceedings', 'conference', 'extraction', 'learning', 'computational', 'information', 'language', 'association', 'word', 'methods']
topic_5:  ['data', 'news', 'clustering', 'model', 'set', 'methods', 'online', 'models', 'patent', 'problem']
topic_6:  ['data', 'clinical', 'clustering', 'trial', 'emr', 'patient', 'patients', 'vector', 'medical', 'trials']
topic_7:  ['patent', 'question', 'questions', 'word', 'words', 'summarization', 'based', 'model', 'information', 'data']
topic_8:  ['knowledge', 'entity', 'resolution', 'subjectivity', 'methods', 'tax', 'concept', 'entities', 'semantic', 'anaphora']
topic_9:  ['clinical', 'knowledge', 'concept', 'argumentative', 'literature', 'inform', 'mining', 'med', 'disease', 'learning']
topic_10:  ['model', 'articles', 'data', 'topic', 'training', 'citation', 'article', 'used', 'research', 'topics']
topic_11:  ['model', 'models', 'medical', 'clinical', 'bert', 'text', 'biomedical', 'classification', 'language', 'embeddings']
topic_12:  ['learning', 'network', 'embedding', 'graph', 'networks', 'node', 'nodes', 'data', 'information', 'representation']
topic_13:  ['event', 'social', 'lockdown', 'class', 'ratio', 'learning', 'data', 'media', 'events', 'distancing']
topic_14:  ['spam', 'social', 'detection', 'features', 'patent', 'classification', 'learning', 'dataset', 'used', 'text']
topic_15:  ['sket', 'reports', 'pathology', 'data', 'disease', 'cancer', 'network', 'concepts', 'networks', 'fication']

Figure 6. Quality metrics accross training iterations for hierarchical model - bioinformatics dataset

Results for level 0

Sparsity Phi: 0.373 
Sparsity Theta: 0.000
Kernel contrast: 0.871
Kernel purity: 0.914
Results for level 1

Sparsity Phi: 0.570 
Sparsity Theta: 0.000
Kernel contrast: 0.827
Kernel purity: 0.780
Results for level 2

Sparsity Phi: 0.686 
Sparsity Theta: 0.000
Kernel contrast: 0.816
Kernel purity: 0.787
Results for level 3

Sparsity Phi: 0.000 
Sparsity Theta: 0.000
Kernel contrast: 0.471
Kernel purity: 0.402

Figure 7. Spectral clustering results - NLP dataset

doc_names cluster_id max_p_topic_id
2 Babaiha_et_al._-_2023_-_A_natural_language_processing_system_for_the_effic 0 1
6 D'Ercole_et_al._-_2022_-_Classifying_news_articles_in_multiple_languages_l 0 1
8 Detroja_et_al._-_2023_-_A_survey_on_Relation_Extraction 0 1
11 Doan_and_Gulla_-_2022_-_A_Survey_on_Political_Viewpoints_Identification 0 1
12 Fu_et_al._-_2020_-_Clinical_concept_extraction_A_methodology_review 0 1
16 Giordano_et_al._-_2023_-_Unveiling_the_inventive_process_from_patents_by_ex 0 1
18 Haneczok_and_Piskorski_-_2020_-_Shallow_and_deep_learning_for_event_relatedness_cl 0 1
19 Harnoune_et_al._-_2021_-_BERT_based_clinical_knowledge_extraction_for_biome 0 1
25 Lathabai_et_al._-_2022_-_Institutional_collaboration_recommendation_An_exp 0 1
26 Li_et_al._-_2021_-_Can_social_media_data_be_used_to_evaluate_the_risk 0 1
27 Li_et_al._-_2022_-_Neural_Natural_Language_Processing_for_unstructure 0 1
28 Lupi_et_al._-_2023_-_Automatic_definition_of_engineer_archetypes_A_tex 0 1
30 López-Úbeda_et_al._-_2022_-_Natural_Language_Processing_in_Pathology_Current_ 0 1
31 Mao_et_al._-_2024_-_A_survey_on_semantic_processing_techniques 0 1
32 Marchesin_et_al._-_2022_-_Empowering_digital_pathology_applications_through_ 0 1
33 May_et_al._-_2022_-_Applying_Natural_Language_Processing_in_Manufactur 0 1
34 Medić_and_Šnajder_-_2022_-_An_empirical_study_of_the_design_choices_for_local 0 1
35 Oral_et_al._-_2020_-_Information_Extraction_from_Text_Intensive_and_Vis 0 1
36 Othman_et_al._-_2019_-_Enhancing_Question_Retrieval_in_Community_Question 0 1
39 Pérez-Pérez_et_al._-_2023_-_A_novel_gluten_knowledge_base_of_potential_biomedi 0 1
41 Ruijie_et_al._-_2021_-_Patent_text_modeling_strategy_and_its_classificati 0 1
43 Timmerman_and_Bronselaer_-_2022_-_Automated_monitoring_of_online_news_accuracy_with_ 0 1
44 Wang_et_al._-_2021_-_Knowledge_graph_quality_control_A_survey 0 1
46 Zangari_et_al._-_2023_-_Ticket_automation_An_insight_into_current_researc 0 1
48 Zhao_et_al._-_2023_-_Weak-PMLC_A_large-scale_framework_for_multi-label 0 1
0 Accuosto_and_Saggion_-_2020_-_Mining_arguments_in_scientific_abstracts_with_disc 1 0
1 Amara_et_al._-_2021_-_Network_representation_learning_systematic_review 1 0
3 Baek_et_al._-_2021_-_A_critical_review_of_text-based_research_in_constr 1 0
4 Bondielli_and_Marcelloni_-_2021_-_On_the_use_of_summarization_and_transformer_archit 1 0
5 Curiskis_et_al._-_2020_-_An_evaluation_of_document_clustering_and_topic_mod 1 0
7 De_Clercq_et_al._-_2019_-_Multi-label_classification_and_interactive_NLP-bas 1 0
9 Dhayne_et_al._-_2021_-_EMR2vec_Bridging_the_gap_between_patient_data_and 1 0
10 Di_Girolamo_et_al._-_2021_-_Evolutionary_game_theoretical_on-line_event_detect 1 0
13 Fuenteslópez_et_al._-_2023_-_Biomaterials_text_mining_A_hands-on_comparative_s 1 0
14 García_del_Valle_et_al._-_2019_-_Disease_networks_and_their_contribution_to_disease 1 0
15 García-Díaz_et_al._-_2020_-_Ontology-driven_aspect-based_sentiment_analysis_cl 1 0
17 Gutman_Music_et_al._-_2022_-_Mapping_dreams_in_a_computational_space_A_phrase- 1 0
20 Ilievski_et_al._-_2020_-_The_role_of_knowledge_in_determining_identity_of_l 1 0
21 Jain_et_al._-_2021_-_Summarization_of_legal_documents_Where_are_we_now 1 0
22 Jain_et_al._-_2023_-_Bayesian_Optimization_based_Score_Fusion_of_Lingui 1 0
23 Jáñez-Martino_et_al._-_2023_-_Classifying_spam_emails_using_agglomerative_hierar 1 0
24 Kumar_and_III_-_2011_-_A_Co-training_Approach_for_Multi-view_Spectral_Clu 1 0
29 Lytos_et_al._-_2019_-_The_evolution_of_argumentation_mining_From_models 1 0
37 Paolanti_and_Frontoni_-_2020_-_Multidisciplinary_Pattern_Recognition_applications 1 0
38 Pisaneschi_et_al._-_2023_-_Automatic_generation_of_scientific_papers_for_data 1 0
40 Rao_et_al._-_2021_-_A_review_on_social_spam_detection_Challenges,_ope 1 0
42 Strąk_and_Tuszyński_-_2020_-_Quantitative_analysis_of_a_private_tax_rulings_cor 1 0
45 Wang_et_al._-_2022_-_Deep_learning_modeling_of_public’s_sentiments_towa 1 0
47 Zhao_et_al._-_2021_-_Entropy-aware_self-training_for_graph_convolutiona 1 0
49 Zulkarnain_and_Putri_-_2021_-_Intelligent_transportation_systems_(ITS)_A_system 1 0
doc_names cluster_id max_p_topic_id
44 Wang_et_al._-_2021_-_Knowledge_graph_quality_control_A_survey 0 2
0 Accuosto_and_Saggion_-_2020_-_Mining_arguments_in_scientific_abstracts_with_disc 0 3
3 Baek_et_al._-_2021_-_A_critical_review_of_text-based_research_in_constr 0 3
8 Detroja_et_al._-_2023_-_A_survey_on_Relation_Extraction 0 3
15 García-Díaz_et_al._-_2020_-_Ontology-driven_aspect-based_sentiment_analysis_cl 0 3
17 Gutman_Music_et_al._-_2022_-_Mapping_dreams_in_a_computational_space_A_phrase- 0 3
18 Haneczok_and_Piskorski_-_2020_-_Shallow_and_deep_learning_for_event_relatedness_cl 0 3
31 Mao_et_al._-_2024_-_A_survey_on_semantic_processing_techniques 0 3
1 Amara_et_al._-_2021_-_Network_representation_learning_systematic_review 1 0
2 Babaiha_et_al._-_2023_-_A_natural_language_processing_system_for_the_effic 1 0
4 Bondielli_and_Marcelloni_-_2021_-_On_the_use_of_summarization_and_transformer_archit 1 0
5 Curiskis_et_al._-_2020_-_An_evaluation_of_document_clustering_and_topic_mod 1 0
6 D'Ercole_et_al._-_2022_-_Classifying_news_articles_in_multiple_languages_l 1 0
10 Di_Girolamo_et_al._-_2021_-_Evolutionary_game_theoretical_on-line_event_detect 1 0
19 Harnoune_et_al._-_2021_-_BERT_based_clinical_knowledge_extraction_for_biome 1 0
20 Ilievski_et_al._-_2020_-_The_role_of_knowledge_in_determining_identity_of_l 1 0
21 Jain_et_al._-_2021_-_Summarization_of_legal_documents_Where_are_we_now 1 0
22 Jain_et_al._-_2023_-_Bayesian_Optimization_based_Score_Fusion_of_Lingui 1 0
24 Kumar_and_III_-_2011_-_A_Co-training_Approach_for_Multi-view_Spectral_Clu 1 0
28 Lupi_et_al._-_2023_-_Automatic_definition_of_engineer_archetypes_A_tex 1 0
33 May_et_al._-_2022_-_Applying_Natural_Language_Processing_in_Manufactur 1 0
35 Oral_et_al._-_2020_-_Information_Extraction_from_Text_Intensive_and_Vis 1 0
38 Pisaneschi_et_al._-_2023_-_Automatic_generation_of_scientific_papers_for_data 1 0
43 Timmerman_and_Bronselaer_-_2022_-_Automated_monitoring_of_online_news_accuracy_with_ 1 0
47 Zhao_et_al._-_2021_-_Entropy-aware_self-training_for_graph_convolutiona 1 0
46 Zangari_et_al._-_2023_-_Ticket_automation_An_insight_into_current_researc 1 2
13 Fuenteslópez_et_al._-_2023_-_Biomaterials_text_mining_A_hands-on_comparative_s 2 2
14 García_del_Valle_et_al._-_2019_-_Disease_networks_and_their_contribution_to_disease 2 2
25 Lathabai_et_al._-_2022_-_Institutional_collaboration_recommendation_An_exp 2 2
34 Medić_and_Šnajder_-_2022_-_An_empirical_study_of_the_design_choices_for_local 2 2
36 Othman_et_al._-_2019_-_Enhancing_Question_Retrieval_in_Community_Question 2 2
39 Pérez-Pérez_et_al._-_2023_-_A_novel_gluten_knowledge_base_of_potential_biomedi 2 2
42 Strąk_and_Tuszyński_-_2020_-_Quantitative_analysis_of_a_private_tax_rulings_cor 2 2
49 Zulkarnain_and_Putri_-_2021_-_Intelligent_transportation_systems_(ITS)_A_system 2 2
7 De_Clercq_et_al._-_2019_-_Multi-label_classification_and_interactive_NLP-bas 3 1
9 Dhayne_et_al._-_2021_-_EMR2vec_Bridging_the_gap_between_patient_data_and 3 1
11 Doan_and_Gulla_-_2022_-_A_Survey_on_Political_Viewpoints_Identification 3 1
12 Fu_et_al._-_2020_-_Clinical_concept_extraction_A_methodology_review 3 1
16 Giordano_et_al._-_2023_-_Unveiling_the_inventive_process_from_patents_by_ex 3 1
23 Jáñez-Martino_et_al._-_2023_-_Classifying_spam_emails_using_agglomerative_hierar 3 1
26 Li_et_al._-_2021_-_Can_social_media_data_be_used_to_evaluate_the_risk 3 1
27 Li_et_al._-_2022_-_Neural_Natural_Language_Processing_for_unstructure 3 1
29 Lytos_et_al._-_2019_-_The_evolution_of_argumentation_mining_From_models 3 1
30 López-Úbeda_et_al._-_2022_-_Natural_Language_Processing_in_Pathology_Current_ 3 1
32 Marchesin_et_al._-_2022_-_Empowering_digital_pathology_applications_through_ 3 1
37 Paolanti_and_Frontoni_-_2020_-_Multidisciplinary_Pattern_Recognition_applications 3 1
40 Rao_et_al._-_2021_-_A_review_on_social_spam_detection_Challenges,_ope 3 1
41 Ruijie_et_al._-_2021_-_Patent_text_modeling_strategy_and_its_classificati 3 1
45 Wang_et_al._-_2022_-_Deep_learning_modeling_of_public’s_sentiments_towa 3 1
48 Zhao_et_al._-_2023_-_Weak-PMLC_A_large-scale_framework_for_multi-label 3 1
doc_names cluster_id max_p_topic_id
3 Baek_et_al._-_2021_-_A_critical_review_of_text-based_research_in_constr 0 1
11 Doan_and_Gulla_-_2022_-_A_Survey_on_Political_Viewpoints_Identification 0 1
12 Fu_et_al._-_2020_-_Clinical_concept_extraction_A_methodology_review 0 1
27 Li_et_al._-_2022_-_Neural_Natural_Language_Processing_for_unstructure 0 1
35 Oral_et_al._-_2020_-_Information_Extraction_from_Text_Intensive_and_Vis 0 1
46 Zangari_et_al._-_2023_-_Ticket_automation_An_insight_into_current_researc 0 1
48 Zhao_et_al._-_2023_-_Weak-PMLC_A_large-scale_framework_for_multi-label 0 1
8 Detroja_et_al._-_2023_-_A_survey_on_Relation_Extraction 1 4
31 Mao_et_al._-_2024_-_A_survey_on_semantic_processing_techniques 1 4
43 Timmerman_and_Bronselaer_-_2022_-_Automated_monitoring_of_online_news_accuracy_with_ 1 4
44 Wang_et_al._-_2021_-_Knowledge_graph_quality_control_A_survey 1 4
21 Jain_et_al._-_2021_-_Summarization_of_legal_documents_Where_are_we_now 2 1
18 Haneczok_and_Piskorski_-_2020_-_Shallow_and_deep_learning_for_event_relatedness_cl 2 2
22 Jain_et_al._-_2023_-_Bayesian_Optimization_based_Score_Fusion_of_Lingui 2 2
28 Lupi_et_al._-_2023_-_Automatic_definition_of_engineer_archetypes_A_tex 2 2
29 Lytos_et_al._-_2019_-_The_evolution_of_argumentation_mining_From_models 2 2
30 López-Úbeda_et_al._-_2022_-_Natural_Language_Processing_in_Pathology_Current_ 2 2
32 Marchesin_et_al._-_2022_-_Empowering_digital_pathology_applications_through_ 2 2
33 May_et_al._-_2022_-_Applying_Natural_Language_Processing_in_Manufactur 2 2
38 Pisaneschi_et_al._-_2023_-_Automatic_generation_of_scientific_papers_for_data 2 2
6 D'Ercole_et_al._-_2022_-_Classifying_news_articles_in_multiple_languages_l 2 7
0 Accuosto_and_Saggion_-_2020_-_Mining_arguments_in_scientific_abstracts_with_disc 3 5
10 Di_Girolamo_et_al._-_2021_-_Evolutionary_game_theoretical_on-line_event_detect 3 5
15 García-Díaz_et_al._-_2020_-_Ontology-driven_aspect-based_sentiment_analysis_cl 3 5
23 Jáñez-Martino_et_al._-_2023_-_Classifying_spam_emails_using_agglomerative_hierar 3 5
37 Paolanti_and_Frontoni_-_2020_-_Multidisciplinary_Pattern_Recognition_applications 3 5
40 Rao_et_al._-_2021_-_A_review_on_social_spam_detection_Challenges,_ope 3 5
42 Strąk_and_Tuszyński_-_2020_-_Quantitative_analysis_of_a_private_tax_rulings_cor 3 5
45 Wang_et_al._-_2022_-_Deep_learning_modeling_of_public’s_sentiments_towa 3 5
5 Curiskis_et_al._-_2020_-_An_evaluation_of_document_clustering_and_topic_mod 4 0
7 De_Clercq_et_al._-_2019_-_Multi-label_classification_and_interactive_NLP-bas 4 0
13 Fuenteslópez_et_al._-_2023_-_Biomaterials_text_mining_A_hands-on_comparative_s 4 0
24 Kumar_and_III_-_2011_-_A_Co-training_Approach_for_Multi-view_Spectral_Clu 4 0
25 Lathabai_et_al._-_2022_-_Institutional_collaboration_recommendation_An_exp 4 0
39 Pérez-Pérez_et_al._-_2023_-_A_novel_gluten_knowledge_base_of_potential_biomedi 4 0
1 Amara_et_al._-_2021_-_Network_representation_learning_systematic_review 5 7
2 Babaiha_et_al._-_2023_-_A_natural_language_processing_system_for_the_effic 5 7
14 García_del_Valle_et_al._-_2019_-_Disease_networks_and_their_contribution_to_disease 5 7
19 Harnoune_et_al._-_2021_-_BERT_based_clinical_knowledge_extraction_for_biome 5 7
20 Ilievski_et_al._-_2020_-_The_role_of_knowledge_in_determining_identity_of_l 5 7
47 Zhao_et_al._-_2021_-_Entropy-aware_self-training_for_graph_convolutiona 5 7
9 Dhayne_et_al._-_2021_-_EMR2vec_Bridging_the_gap_between_patient_data_and 6 3
17 Gutman_Music_et_al._-_2022_-_Mapping_dreams_in_a_computational_space_A_phrase- 6 3
34 Medić_and_Šnajder_-_2022_-_An_empirical_study_of_the_design_choices_for_local 6 3
49 Zulkarnain_and_Putri_-_2021_-_Intelligent_transportation_systems_(ITS)_A_system 6 3
4 Bondielli_and_Marcelloni_-_2021_-_On_the_use_of_summarization_and_transformer_archit 7 6
16 Giordano_et_al._-_2023_-_Unveiling_the_inventive_process_from_patents_by_ex 7 6
26 Li_et_al._-_2021_-_Can_social_media_data_be_used_to_evaluate_the_risk 7 6
36 Othman_et_al._-_2019_-_Enhancing_Question_Retrieval_in_Community_Question 7 6
41 Ruijie_et_al._-_2021_-_Patent_text_modeling_strategy_and_its_classificati 7 6
doc_names cluster_id max_p_topic_id
7 De_Clercq_et_al._-_2019_-_Multi-label_classification_and_interactive_NLP-bas 0 7
21 Jain_et_al._-_2021_-_Summarization_of_legal_documents_Where_are_we_now 0 7
22 Jain_et_al._-_2023_-_Bayesian_Optimization_based_Score_Fusion_of_Lingui 0 7
36 Othman_et_al._-_2019_-_Enhancing_Question_Retrieval_in_Community_Question 0 7
38 Pisaneschi_et_al._-_2023_-_Automatic_generation_of_scientific_papers_for_data 0 7
25 Lathabai_et_al._-_2022_-_Institutional_collaboration_recommendation_An_exp 1 0
11 Doan_and_Gulla_-_2022_-_A_Survey_on_Political_Viewpoints_Identification 2 2
48 Zhao_et_al._-_2023_-_Weak-PMLC_A_large-scale_framework_for_multi-label 2 2
4 Bondielli_and_Marcelloni_-_2021_-_On_the_use_of_summarization_and_transformer_archit 3 5
5 Curiskis_et_al._-_2020_-_An_evaluation_of_document_clustering_and_topic_mod 3 5
16 Giordano_et_al._-_2023_-_Unveiling_the_inventive_process_from_patents_by_ex 3 5
43 Timmerman_and_Bronselaer_-_2022_-_Automated_monitoring_of_online_news_accuracy_with_ 3 5
17 Gutman_Music_et_al._-_2022_-_Mapping_dreams_in_a_computational_space_A_phrase- 4 1
9 Dhayne_et_al._-_2021_-_EMR2vec_Bridging_the_gap_between_patient_data_and 5 6
20 Ilievski_et_al._-_2020_-_The_role_of_knowledge_in_determining_identity_of_l 5 6
24 Kumar_and_III_-_2011_-_A_Co-training_Approach_for_Multi-view_Spectral_Clu 5 6
33 May_et_al._-_2022_-_Applying_Natural_Language_Processing_in_Manufactur 5 6
14 García_del_Valle_et_al._-_2019_-_Disease_networks_and_their_contribution_to_disease 6 15
30 López-Úbeda_et_al._-_2022_-_Natural_Language_Processing_in_Pathology_Current_ 6 15
32 Marchesin_et_al._-_2022_-_Empowering_digital_pathology_applications_through_ 6 15
3 Baek_et_al._-_2021_-_A_critical_review_of_text-based_research_in_constr 7 3
15 García-Díaz_et_al._-_2020_-_Ontology-driven_aspect-based_sentiment_analysis_cl 7 3
29 Lytos_et_al._-_2019_-_The_evolution_of_argumentation_mining_From_models 7 3
45 Wang_et_al._-_2022_-_Deep_learning_modeling_of_public’s_sentiments_towa 7 3
6 D'Ercole_et_al._-_2022_-_Classifying_news_articles_in_multiple_languages_l 8 10
13 Fuenteslópez_et_al._-_2023_-_Biomaterials_text_mining_A_hands-on_comparative_s 8 10
28 Lupi_et_al._-_2023_-_Automatic_definition_of_engineer_archetypes_A_tex 8 10
34 Medić_and_Šnajder_-_2022_-_An_empirical_study_of_the_design_choices_for_local 8 10
49 Zulkarnain_and_Putri_-_2021_-_Intelligent_transportation_systems_(ITS)_A_system 8 10
0 Accuosto_and_Saggion_-_2020_-_Mining_arguments_in_scientific_abstracts_with_disc 9 9
12 Fu_et_al._-_2020_-_Clinical_concept_extraction_A_methodology_review 9 9
39 Pérez-Pérez_et_al._-_2023_-_A_novel_gluten_knowledge_base_of_potential_biomedi 9 9
8 Detroja_et_al._-_2023_-_A_survey_on_Relation_Extraction 10 4
35 Oral_et_al._-_2020_-_Information_Extraction_from_Text_Intensive_and_Vis 10 14
31 Mao_et_al._-_2024_-_A_survey_on_semantic_processing_techniques 11 8
42 Strąk_and_Tuszyński_-_2020_-_Quantitative_analysis_of_a_private_tax_rulings_cor 11 8
44 Wang_et_al._-_2021_-_Knowledge_graph_quality_control_A_survey 11 8
18 Haneczok_and_Piskorski_-_2020_-_Shallow_and_deep_learning_for_event_relatedness_cl 12 13
23 Jáñez-Martino_et_al._-_2023_-_Classifying_spam_emails_using_agglomerative_hierar 12 13
26 Li_et_al._-_2021_-_Can_social_media_data_be_used_to_evaluate_the_risk 12 13
1 Amara_et_al._-_2021_-_Network_representation_learning_systematic_review 13 12
10 Di_Girolamo_et_al._-_2021_-_Evolutionary_game_theoretical_on-line_event_detect 13 12
37 Paolanti_and_Frontoni_-_2020_-_Multidisciplinary_Pattern_Recognition_applications 13 12
47 Zhao_et_al._-_2021_-_Entropy-aware_self-training_for_graph_convolutiona 13 12
40 Rao_et_al._-_2021_-_A_review_on_social_spam_detection_Challenges,_ope 14 14
41 Ruijie_et_al._-_2021_-_Patent_text_modeling_strategy_and_its_classificati 14 14
2 Babaiha_et_al._-_2023_-_A_natural_language_processing_system_for_the_effic 15 11
19 Harnoune_et_al._-_2021_-_BERT_based_clinical_knowledge_extraction_for_biome 15 11
27 Li_et_al._-_2022_-_Neural_Natural_Language_Processing_for_unstructure 15 11
46 Zangari_et_al._-_2023_-_Ticket_automation_An_insight_into_current_researc 15 11

Figure 8. Topic hierarchy structure - NLP dataset